23/03/2025 - 29/03/2025

24/03/2025 17:31

Here is my "simplified" software diagram for the atar daq

graph TD;
    NaluFrontend["<b><a href='https://github.com/PIONEER-Experiment/atar_daq'>Nalu MIDAS Frontend</a></b><br>Coordinates MIDAS event construction"]

    subgraph Libraries
        NaluBoardLib["<b><a href='https://github.com/jaca230/nalu_board_controller'>Nalu Board Controller</a></b><br>C++ Wrapper around naludaq methods for configuring the board and starting readout"]
        NaluEventCollectorLib["<b><a href='https://github.com/jaca230/nalu_event_collector'>Nalu Event Collector</a></b><br>C++ API for launching collector threads. Handles receiving data over UDP, processing packets, and collecting into NaluEvents"]
        MidasLib["<b><a href='https://bitbucket.org/tmidas/midas/src/develop/'>MIDAS</a></b><br>Data acquisition framework"]
        ReflectCppLib["<b><a href='https://github.com/getml/reflect-cpp'>reflect-cpp</a></b><br>C++ reflection library used for serialization"]
    end

    subgraph Python Packages
        NaludaqPython["<b><a href='https://pypi.org/project/naludaq/0.31.9/'>naludaq</a></b><br>Python interface for Naludaq"]
    end

    subgraph Classes
        NaluBoardController["<b><a href='https://github.com/jaca230/nalu_board_controller/blob/main/include/nalu_board_controller.h'>nalu_board_controller</a></b><br>Provides methods for configuring the board and starting readout"]
        NaluEventCollector["<b><a href='https://github.com/jaca230/nalu_event_collector/blob/main/include/nalu_event_collector.h'>nalu_event_collector</a></b><br>Provides methods for starting collector threads and polling for events"]
        OdbManager["<b><a href='https://github.com/PIONEER-Experiment/atar_daq/blob/main/include/odb_manager.h'>odb_manager</a></b><br>Handles initializing and managing ODB structure for Nalu Equipment"]
        MidasFrontend["<b><a href='https://bitbucket.org/tmidas/midas/src/develop/include/mfe.h'>mfe</a></b><br>Handles MIDAS frontend logic"]
    end

    %% Connect libraries to the PythonPackages layer
    NaluBoardLib -->|Pybind| NaludaqPython

    %% Connect libraries to the Classes layer
    NaluFrontend -->|Uses| NaluBoardLib
    NaluFrontend -->|Uses| NaluEventCollectorLib
    NaluFrontend -->|Uses| MidasLib
    NaluFrontend -->|Uses| ReflectCppLib
    NaluBoardLib -->|Provides| NaluBoardController
    NaluEventCollectorLib -->|Provides| NaluEventCollector
    MidasLib -->|Provides| MidasFrontend
    ReflectCppLib --> |Used By| OdbManager

graph TD;
    NaluFrontend["<b><a href='https://github.com/PIONEER-Experiment/atar_daq'>Nalu MIDAS Frontend</a></b><br>Coordinates MIDAS event construction"]

    subgraph Libraries
        NaluBoardLib["<b><a href='https://github.com/jaca230/nalu_board_controller'>Nalu Board Controller</a></b><br>C++ Wrapper around naludaq methods for configuring the board and starting readout"]
        NaluEventCollectorLib["<b><a href='https://github.com/jaca230/nalu_event_collector'>Nalu Event Collector</a></b><br>C++ API for launching collector threads. Handles receiving data over UDP, processing packets, and collecting into NaluEvents"]
        MidasLib["<b><a href='https://bitbucket.org/tmidas/midas/src/develop/'>MIDAS</a></b><br>Data acquisition framework"]
        ReflectCppLib["<b><a href='https://github.com/getml/reflect-cpp'>reflect-cpp</a></b><br>C++ reflection library used for serialization"]
    end

    subgraph Python Packages
        NaludaqPython["<b><a href='https://pypi.org/project/naludaq/0.31.9/'>naludaq</a></b><br>Python interface for Naludaq"]
    end

    subgraph Classes
        NaluBoardController["<b><a href='https://github.com/jaca230/nalu_board_controller/blob/main/include/nalu_board_controller.h'>nalu_board_controller</a></b><br>Provides methods for configuring the board and starting readout"]
        NaluEventCollector["<b><a href='https://github.com/jaca230/nalu_event_collector/blob/main/include/nalu_event_collector.h'>nalu_event_collector</a></b><br>Provides methods for starting collector threads and polling for events"]
        OdbManager["<b><a href='https://github.com/PIONEER-Experiment/atar_daq/blob/main/include/odb_manager.h'>odb_manager</a></b><br>Handles initializing and managing ODB structure for Nalu Equipment"]
        MidasFrontend["<b><a href='https://bitbucket.org/tmidas/midas/src/develop/include/mfe.h'>mfe</a></b><br>Handles MIDAS frontend logic"]
    end

    %% Connect libraries to the PythonPackages layer
    NaluBoardLib -->|Pybind| NaludaqPython

    %% Connect libraries to the Classes layer
    NaluFrontend -->|Uses| NaluBoardLib
    NaluFrontend -->|Uses| NaluEventCollectorLib
    NaluFrontend -->|Uses| MidasLib
    NaluFrontend -->|Uses| ReflectCppLib
    NaluBoardLib -->|Provides| NaluBoardController
    NaluEventCollectorLib -->|Provides| NaluEventCollector
    MidasLib -->|Provides| MidasFrontend
    ReflectCppLib --> |Used By| OdbManager

25/03/2025 13:35

I identified "problematic" rate test parameters with this script

# Step 1: First filter based on Expected Data Rate
df_filtered_initial = df[df['Expected Data Rate (KB/s)'] < 55000].copy()

# Step 2: Define filtering conditions on this subset
condition_1 = df_filtered_initial['Collector Error'].notna() & (df_filtered_initial['Collector Error'] != "None")
condition_2 = ~df_filtered_initial['kBytes per sec'].div(df_filtered_initial['Expected Data Rate (KB/s)']).between(0.8, 1.4)
condition_3 = ~df_filtered_initial['Frequency (Hz)'].div(df_filtered_initial['Data Rate (Events per sec)']).between(0.9, 1.1)
condition_4 = df_filtered_initial['Frequency (Hz)'] > 1000  # Frequency must be above 1 kHz

# Step 3: Create a reason column to track which conditions were met
df_filtered_initial['Reason'] = ''

df_filtered_initial.loc[condition_1, 'Reason'] += 'Collector Error; '
df_filtered_initial.loc[condition_2, 'Reason'] += 'Data Rate Mismatch; '
df_filtered_initial.loc[condition_3, 'Reason'] += 'Frequency/Data Rate Mismatch; '

# Step 4: Apply the additional filtering conditions
filtered_df = df_filtered_initial[(condition_1 | condition_2 | condition_3) & condition_4].copy()

# Display row count and first few rows for verification
print(f"Filtered DataFrame has {filtered_df.shape[0]} rows.")
filtered_df[['File', 'Frequency (Hz)', 'Data Rate (Events per sec)', 
             'Windows', 'Events Sent', 'kBytes per sec', 
             'Active Channels Length', 'Expected Data Rate (KB/s)', 
             'Collector Error', 'Reason']]


# Define the output file path
output_file = "filtered_data.txt"

# Open the file and write each row in the specified format
with open(output_file, "w") as f:
    for _, row in filtered_df.iterrows():
        frequency = int(row['Frequency (Hz)'])
        windows = int(row['Windows'])
        channels = int(row['Active Channels Length'])
        computed_value = frequency * windows * channels
        
        # Format the line as: 0 0 0 {frequency} {windows} {channels} {computed_value}
        f.write(f"0 0 0 {frequency} {windows} {channels} {computed_value}\n")

print(f"Filtered data has been written to {output_file}")

# Step 1: First filter based on Expected Data Rate
df_filtered_initial = df[df['Expected Data Rate (KB/s)'] < 55000].copy()

# Step 2: Define filtering conditions on this subset
condition_1 = df_filtered_initial['Collector Error'].notna() & (df_filtered_initial['Collector Error'] != "None")
condition_2 = ~df_filtered_initial['kBytes per sec'].div(df_filtered_initial['Expected Data Rate (KB/s)']).between(0.8, 1.4)
condition_3 = ~df_filtered_initial['Frequency (Hz)'].div(df_filtered_initial['Data Rate (Events per sec)']).between(0.9, 1.1)
condition_4 = df_filtered_initial['Frequency (Hz)'] > 1000  # Frequency must be above 1 kHz

# Step 3: Create a reason column to track which conditions were met
df_filtered_initial['Reason'] = ''

df_filtered_initial.loc[condition_1, 'Reason'] += 'Collector Error; '
df_filtered_initial.loc[condition_2, 'Reason'] += 'Data Rate Mismatch; '
df_filtered_initial.loc[condition_3, 'Reason'] += 'Frequency/Data Rate Mismatch; '

# Step 4: Apply the additional filtering conditions
filtered_df = df_filtered_initial[(condition_1 | condition_2 | condition_3) & condition_4].copy()

# Display row count and first few rows for verification
print(f"Filtered DataFrame has {filtered_df.shape[0]} rows.")
filtered_df[['File', 'Frequency (Hz)', 'Data Rate (Events per sec)', 
             'Windows', 'Events Sent', 'kBytes per sec', 
             'Active Channels Length', 'Expected Data Rate (KB/s)', 
             'Collector Error', 'Reason']]


# Define the output file path
output_file = "filtered_data.txt"

# Open the file and write each row in the specified format
with open(output_file, "w") as f:
    for _, row in filtered_df.iterrows():
        frequency = int(row['Frequency (Hz)'])
        windows = int(row['Windows'])
        channels = int(row['Active Channels Length'])
        computed_value = frequency * windows * channels
        
        # Format the line as: 0 0 0 {frequency} {windows} {channels} {computed_value}
        f.write(f"0 0 0 {frequency} {windows} {channels} {computed_value}\n")

print(f"Filtered data has been written to {output_file}")

So my criteria are:

The expected data rate must be below 55 MB/s.
1. This is the limit the board can output, so we don't expect good performance above this.
Must over over 1kHz trigger rate
1. I only do this because the "expected data rate" calculation is poor for low event rates. If I don't make this cut, I get a lot of data points that weren't problematic
The "normalized" data rate is outside the range [0.8,1.4].
1. So we differ from the expected data rate by a meaningful percentage
The "normalized" event rate is outside the range [0.9,1.1]
1. So we differ from the expected event rate (the external trigger rate) by a meaningul percentage
The run contained an error

25/03/2025 13:44

Using the above criteria, I created a parmeter space for the sequencer to go through. For each value in the parameter space I did a 1 minute long run. I took a sample every 4 seconds of midas' measured data rate and event rate. So each "problematic parameter set" had 15 sequential samples.

25/03/2025 13:43

After collecting this data I average it and computed means and uncertainties like so:

# Function to compute Time to Stability
def time_to_stability(data_rates, tolerance=0.01, window=3):
    """Returns the index where data rate stabilizes within tolerance of final value for a given run."""
    final_value = data_rates.iloc[-1]  # Assume last value is steady-state
    threshold = final_value * (1 - tolerance)  # Define stability threshold
    
    for i in range(len(data_rates) - window):
        if np.all(data_rates.iloc[i:i+window] >= threshold):
            return i  # First index where stability is reached
    
    return len(data_rates)  # If never stabilizes, return full lengths

def count_collector_errors(errors):
    """Counts the number of non-'None' and non-'N/A' collector errors."""
    return errors[~errors.isin(['None', 'N/A'])].count()


# Define the aggregation functions for each column
agg_funcs = {
    'Frequency (Hz)': 'mean',  # No uncertainty needed
    'Data Rate (Events per sec)': [
        'mean',  # Mean
        lambda x: np.std(x) / np.sqrt(len(x)),  # Uncertainty (standard error)
        time_to_stability  # Compute stability time
    ],
    'Windows': 'mean',  # No uncertainty needed
    'Events Sent': 'max',  # Take the maximum value
    'kBytes per sec': [
        'mean',  # Mean
        lambda x: np.std(x) / np.sqrt(len(x))  # Uncertainty (standard error)
    ],
    'Active Channels Length': 'mean',  # No uncertainty needed
    'Expected Data Rate (KB/s)': 'mean',  # No uncertainty needed
    'Collector Error': count_collector_errors  # Count occurrences of actual errors
}

# Perform the groupby aggregation
consolidated_df = df.groupby('Run Number').agg(agg_funcs)

# Rename the columns for clarity
consolidated_df.columns = [
    'Avg Frequency (Hz)',
    'Avg Data Rate (Events per sec)', 'Uncertainty Data Rate', 'Time to Stability',
    'Avg Windows',
    'Max Events Sent',
    'Avg kBytes per sec', 'Uncertainty kBytes per sec',
    'Avg Active Channels Length',
    'Avg Expected Data Rate (KB/s)',
    'Collector Error Count'
]

# Compute the normalized values and their uncertainties
consolidated_df['Normalized Frequency'] = consolidated_df['Avg Data Rate (Events per sec)'] / consolidated_df['Avg Frequency (Hz)']
consolidated_df['Uncertainty Normalized Frequency'] = consolidated_df['Uncertainty Data Rate'] / consolidated_df['Avg Frequency (Hz)']

consolidated_df['Normalized kBytes per sec to Expected Data Rate'] = consolidated_df['Avg kBytes per sec'] / consolidated_df['Avg Expected Data Rate (KB/s)']
consolidated_df['Uncertainty Normalized kBytes per sec to Expected Data Rate'] = consolidated_df['Uncertainty kBytes per sec'] / consolidated_df['Avg Expected Data Rate (KB/s)']

# Reset index for better readability
consolidated_df.reset_index(inplace=True)

# Display the consolidated DataFrame
consolidated_df

# Function to compute Time to Stability
def time_to_stability(data_rates, tolerance=0.01, window=3):
    """Returns the index where data rate stabilizes within tolerance of final value for a given run."""
    final_value = data_rates.iloc[-1]  # Assume last value is steady-state
    threshold = final_value * (1 - tolerance)  # Define stability threshold
    
    for i in range(len(data_rates) - window):
        if np.all(data_rates.iloc[i:i+window] >= threshold):
            return i  # First index where stability is reached
    
    return len(data_rates)  # If never stabilizes, return full lengths

def count_collector_errors(errors):
    """Counts the number of non-'None' and non-'N/A' collector errors."""
    return errors[~errors.isin(['None', 'N/A'])].count()


# Define the aggregation functions for each column
agg_funcs = {
    'Frequency (Hz)': 'mean',  # No uncertainty needed
    'Data Rate (Events per sec)': [
        'mean',  # Mean
        lambda x: np.std(x) / np.sqrt(len(x)),  # Uncertainty (standard error)
        time_to_stability  # Compute stability time
    ],
    'Windows': 'mean',  # No uncertainty needed
    'Events Sent': 'max',  # Take the maximum value
    'kBytes per sec': [
        'mean',  # Mean
        lambda x: np.std(x) / np.sqrt(len(x))  # Uncertainty (standard error)
    ],
    'Active Channels Length': 'mean',  # No uncertainty needed
    'Expected Data Rate (KB/s)': 'mean',  # No uncertainty needed
    'Collector Error': count_collector_errors  # Count occurrences of actual errors
}

# Perform the groupby aggregation
consolidated_df = df.groupby('Run Number').agg(agg_funcs)

# Rename the columns for clarity
consolidated_df.columns = [
    'Avg Frequency (Hz)',
    'Avg Data Rate (Events per sec)', 'Uncertainty Data Rate', 'Time to Stability',
    'Avg Windows',
    'Max Events Sent',
    'Avg kBytes per sec', 'Uncertainty kBytes per sec',
    'Avg Active Channels Length',
    'Avg Expected Data Rate (KB/s)',
    'Collector Error Count'
]

# Compute the normalized values and their uncertainties
consolidated_df['Normalized Frequency'] = consolidated_df['Avg Data Rate (Events per sec)'] / consolidated_df['Avg Frequency (Hz)']
consolidated_df['Uncertainty Normalized Frequency'] = consolidated_df['Uncertainty Data Rate'] / consolidated_df['Avg Frequency (Hz)']

consolidated_df['Normalized kBytes per sec to Expected Data Rate'] = consolidated_df['Avg kBytes per sec'] / consolidated_df['Avg Expected Data Rate (KB/s)']
consolidated_df['Uncertainty Normalized kBytes per sec to Expected Data Rate'] = consolidated_df['Uncertainty kBytes per sec'] / consolidated_df['Avg Expected Data Rate (KB/s)']

# Reset index for better readability
consolidated_df.reset_index(inplace=True)

# Display the consolidated DataFrame
consolidated_df

I also computed 2 "new" metrics:

Number of collectors errors
1. This is just how many of these samples had a collector error
2. If sample X in a run has a collector error, all subsequent samples in that run will report having a collector error (i.e. errors are only cleared at the start of a new run).
3. This means this gives a metric for how long a run lasted before seeing a collector error
Time to Stability
1. This is how many samples in a run it took before the data rate stabilized.
2. Mathematically if we take N $N$ samples per run, this is the index i $i$ such that \forall i \leq j \leq N $\forall i \leq j \leq N$ \text{DataRate}[j] \geq 0.99\cdot \text{DataRate}[N] $\text{DataRate}[j] \geq 0.99\cdot \text{DataRate}[N]$
  1. Really I should also require \text{DataRate}[j] \leq 1.01\cdot \text{DataRate}[N] $\text{DataRate}[j] \leq 1.01\cdot \text{DataRate}[N]$ but I didn't and I don't image it has much effect on the metric result shown below.
  2. I retroactively added \text{DataRate}[j] \leq 1.01\cdot \text{DataRate}[N] $\text{DataRate}[j] \leq 1.01\cdot \text{DataRate}[N]$ . See plots below.

25/03/2025 13:58

Here are the plots from the last round of analyzation just for comparison purposes. They don't give much insight into what's causing the lower data rates:

Note: The reson some of the data points are out of the range 55 MB/s expected data rate range despite the fact I made a cut to exclude those earlier is as folllows:
When I made the cut, I used data rate calculation:
\text{Data Rate (B/s)} \approx \text{Trigger rate}\cdot(\text{N}_\text{channels}\cdot\text{N}_\text{windows}\cdot(\text{Packet Length = 80 bytes}) + (\text{Event Header+Footer = 28 bytes})) $\text{Data Rate (B/s)} \approx \text{Trigger rate}\cdot(\text{N}_\text{channels}\cdot\text{N}_\text{windows}\cdot(\text{Packet Length = 80 bytes}) + (\text{Event Header+Footer = 28 bytes}))$
However, this is a mistake, it should be:
\text{Data Rate (B/s)} \approx \text{Trigger rate}\cdot(\text{N}_\text{channels}\cdot\text{N}_\text{windows}\cdot(\text{Packet Length = 80 bytes}) + (\text{Event Header+Footer = 34 bytes}) + (\text{Timing Data = 64 bytes})) $\text{Data Rate (B/s)} \approx \text{Trigger rate}\cdot(\text{N}_\text{channels}\cdot\text{N}_\text{windows}\cdot(\text{Packet Length = 80 bytes}) + (\text{Event Header+Footer = 34 bytes}) + (\text{Timing Data = 64 bytes}))$
So our cut was off by a little bit

25/03/2025 14:05

Here are some "new" plots that are a bit more telling:

Collector error analysis:

We can see how collector errors are dependenat on our parameters. For some reason, they were very prevalent in 2 channels. But the thing most correlated with collector errors is the expected data rate.

Time to stability analysis

Just \text{DataRate}[j] \geq 0.99\cdot \text{DataRate}[N] $\text{DataRate}[j] \geq 0.99\cdot \text{DataRate}[N]$ :

\text{DataRate}[j] \geq 0.99\cdot \text{DataRate}[N] $\text{DataRate}[j] \geq 0.99\cdot \text{DataRate}[N]$ and \text{DataRate}[j] \leq 1.01\cdot \text{DataRate}[N] $\text{DataRate}[j] \leq 1.01\cdot \text{DataRate}[N]$ :

We can see how time for the data rate to stabilize depends on our parameters. Again, it's most correlated with the expected data rate.

Average data rate with uncertainty vs parameters

We have see how the data rate depends on our parameters. We see artifacts of "skipping" event in there.

Normalied average data rate with uncertainty vs parameters

Same plot as above, just normalized based on \frac{\text{average data rate}}{\text{expected data rate}} $\frac{\text{average data rate}}{\text{expected data rate}}$ This gives insight into which parameters are problematic. However, this "expected data rate" calculation has issues because I don't know how to correctly account for all the data going into midas. I.e. the logger logs some additional data (such as bank names, index, etc.) that skews the actual result upwards. as a result, I believe a lot of these events are actual "normal". They were just picked out by my cuts above due to this lack of understanding.

Normalized average event rate with uncertainty vs parameters

This is similar to the plot above, except we're plotting normalized event rate i.e. \frac{\text{average event rate}}{\text{input trigger frequency}} $\frac{\text{average event rate}}{\text{input trigger frequency}}$ . I believe this plot is slightly more telling that the one above because we really expect every one of these data points to be on the red dotted line, otherwise we're missing events. Overall, it's unclear what's causing events to be missed. It's most correlated with the expected data rate.

28/03/2025 16:22

I did a longer run where I don't just look at the failure modes to get more data for the "working modes". What I find is there are actually more errrors than expected. I.e. the 4 second tests did not reveal as many errors as the 60 second tests. I suspect this may have to do with the collectors "time_threshold" parameter choice. If it's not set properly, the collect can fill up. Below are some plots (very similar to above).